Scaled-out transport as connection proxy for device-to-device communications

ABSTRACT

Techniques are described for providing a scaled-out transport supported by interconnected data processing units (DPUs) that operates as a single system bus connection proxy for device-to-device communications within a data center. As one example, this disclosure describes techniques for providing a Peripheral Component Interconnect Express (PCIe) proxy for device-to-device communications employing the PCIe standard. The disclosed techniques include adding PCIe proxy logic on top of a host unit of a DPU to expose a PCIe proxy model to application processors, storage devices, network interface controllers, field programmable gate arrays, or other PCIe endpoint devices. The PCIe proxy model may be implemented as a physically distributed Ethernet-based switch fabric with PCIe proxy logic at the edge and fronting the PCIe endpoint devices. The interconnected DPUs and the distributed Ethernet-based switch fabric together provide a reliable, low-latency, and scaled-out transport that operates as a PCIe proxy.

This application is a continuation of U.S. patent application Ser. No.17/248,828, filed Feb. 9, 2021, which claims the benefit of U.S.Provisional Appl. No. 62/975,033, filed Feb. 11, 2020, the entirecontent of each of which is incorporated herein by reference.

TECHNICAL FIELD

This disclosure relates to communications, and more specifically, toscale out of a system bus connection across a data center fabric.

BACKGROUND

In a typical cloud-based data center, a large collection ofinterconnected servers provides computing and/or storage capacity forexecution of various applications. For example, a data center maycomprise a facility that hosts applications and services forsubscribers, i.e., customers of the data center. The data center may,for example, host all of the infrastructure equipment, such as computenodes, networking and storage systems, power systems, and environmentalcontrol systems. In most data centers, clusters of storage systems andapplication servers are interconnected via a high-speed switch fabricprovided by one or more tiers of physical network switches and routers.Data centers vary greatly in size, with some public data centerscontaining hundreds of thousands of servers, and may be distributedacross multiple geographies for redundancy.

Such networks include devices that may be physically close to eachother, such as a collection of servers and/or other devices locatedwithin a data center or within a data center rack, and that may have aneed to communicate with each other directly. A number of techniqueshave been used for such communications, including those usingdevice-to-device communications employing the Peripheral ComponentInterconnect Express (PCIe) standard. While PCIe has been and maycontinue to be used for device-to-device communications, PCIe wasdeveloped as a high-speed serial computer system bus standard forcommunications over very short distances between devices within the samesystem. Although it is possible for PCIe to be used for communicationsbetween devices not within the same system, such an arrangement is notalways optimal. The communication speeds in such an arrangement mightnot be sufficiently high, and the cost and/or availability of thehardware required to implement such a solution can also be a limitation.

SUMMARY

In general, this disclosure describes techniques for providing ascaled-out transport supported by interconnected data processing units(DPUs) that operates as a single system bus connection proxy fordevice-to-device communications within a data center. As one example,this disclosure describes techniques for providing a PeripheralComponent Interconnect Express (PCIe) proxy for device-to-devicecommunications employing the PCIe standard. In accordance with thetechniques described in this disclosure, the PCIe proxy supportsdisaggregation of resources within the data center by operating aseither a virtual PCIe switch or a virtual PCIe device.

In one example, the techniques provide a physical approach todisaggregation in which the interconnected DPUs operate as a locallyattached PCIe switch used to connect a PCIe host device to a remotelylocated PCIe endpoint device using PCIe over fabric. As another example,the techniques provide a logical approach to disaggregation in which atleast one of the DPUs may operate as a locally attached PCIe device thatis in effect a virtual device used to abstract one or more physical PCIeendpoint devices that are locally or remotely attached to the DPU.

The disclosed techniques include adding PCIe proxy logic on top of ahost unit of a DPU to expose a PCIe proxy model to applicationprocessors (i.e., compute nodes), storage devices (i.e., storage nodes),network interface controllers (NICs), field programmable gate arrays(FPGAs), or other end PCIe devices (e.g., PCIe host and PCIe endpointdevices). In some examples, the PCIe proxy logic implemented on the DPUexposes both local PCIe device functionality and PCIe switchfunctionality for remotely attached PCIe devices. The PCIe proxy modelmay be implemented as a physically distributed Ethernet-based switchfabric with PCIe proxy logic at the edge and fronting the end PCIedevices. In accordance with the disclosed techniques, the interconnectedDPUs and the distributed Ethernet-based switch fabric together provide areliable, low-latency, and scaled-out transport that operates as a PCIeproxy. The scaled-out transport is transparent to the end PCIe devices(i.e., it is logically a locally attached PCIe switch or a locallyattached PCIe device from the end PCIe devices' perspectives) as long asthe reliability and latency provided by the scaled-out transport aresubstantially similar to that provided by PCIe. The host unit of the DPUmay comprise a PCIe controller configured to support both non-volatilememory express (NVMe) storage nodes (e.g., SSDs) and other end PCIedevices such as compute nodes (e.g., devices including CPUs and GPUs),NICs, and FPGAs.

The techniques described in this disclosure further include a tunneltransport protocol used by the interconnected DPUs of the scaled-outtransport to maintain the capabilities of a PCIe switch and make anyPCIe devices exposed by the DPUs appear to be locally attached to thePCIe host device. The PCIe proxy logic implemented on each of the DPUsconverts between PCIe and Ethernet in which multiple PCIe transactionlayer packets (TLPs) may be included in each Ethernet frame. Morespecifically, the PCIe proxy logic supports tunneling PCIe over thescaled-out transport using a tunnel transport protocol over InternetProtocol (IP) over Ethernet encapsulation. Tunneling PCIe using thetunnel transport protocol over IP over Ethernet, as opposed to assigninga new Ethertype in the case PCIe over Ethernet, enables layer 3 (L3)routing to occur within the scaled-out transport. The PCIe proxy logicalso supports reliable transmission of the encapsulated packets withinthe scaled-out transport, and maintains PCIe ordering and deadlockprevention solutions. The PCIe proxy logic may further support securitywithin the transport by using encrypted and authenticated tunnels.Moreover, the PCIe proxy logic may provide hot plug support with dynamicprovisioning and allocation for graceful linking and unlinking of PCIeendpoint devices. The PCIe proxy logic may further enable remote directmemory access (RDMA) from any RDMA capable device connected to thenetwork fabric to proxied PCIe endpoint devices.

The techniques described in this disclosure enable disaggregation ofapplication processors, storage devices, network interface controllers(NICs), field programmable gate arrays (FPGAs), or other PCIe endpointdevices connected via the PCIe proxy. For example, the PCIe proxydescribed herein may decouple the conventional static allocation ofgraphics processing units (GPUs) to specific central processing units(CPUs). In accordance with the described techniques, GPUs may be pooledin a data center and dynamically shared across multiple compute nodesand/or shared across multiple customers. The PCIe proxy may bepositioned between a compute node comprising a CPU and the pool of GPUslocated anywhere in the data center. The PCIe proxy supports dynamicallocation and provisioning of a GPU from the pool of GPUs to the CPU ofthe compute node such that the allocated GPU appears to be a locallyattached device from the perspective of the CPU.

In one example, this disclosure is directed to a network systemcomprising a plurality of DPUs interconnected via a network fabric,wherein each DPU of the plurality of DPUs implements proxy logic for asystem bus connection, and wherein the plurality of DPUs and the networkfabric together operate as a single system bus connection proxy; a hostdevice locally attached to a host unit interface of a first DPU of theplurality of DPUs via a first system bus connection; and a plurality ofendpoint devices locally attached to host unit interfaces of one or moresecond DPUs of the plurality of DPUs via second system bus connections.The first DPU is configured to, upon receipt of packets from the hostdevice on the host unit interface of the first DPU and destined for agiven endpoint device of the plurality of endpoint devices, establish alogical tunnel across the network fabric between the first DPU and oneof the second DPUs to which the given endpoint device is locallyattached, encapsulate the packets using a transport protocol, and sendthe encapsulated packets over the logical tunnel to the one of thesecond DPUs. The one of the second DPUs is configured to, upon receiptof the encapsulated packets, extract the packets and send the packets ona host unit interface of the one of the second DPUs to the givenendpoint device.

In another example, this disclosure is directed to a first DPUintegrated circuit comprising a networking unit interconnected with aplurality of DPUs via a network fabric; a host unit comprising a hostunit interface locally attached to a host device via a system busconnection; and at least one processing core. The at least oneprocessing core is configured to execute proxy logic for a system busconnection, wherein the plurality of DPUs, including the first DPUintegrated circuit, and the network fabric together operate as a singlesystem bus connection proxy, and wherein the host unit interface isconfigured to provide access to the single system bus connection proxyoperating as at least one of a virtual switch attached to one or more ofa plurality of endpoint devices or a virtual device implemented as anabstraction of one or more of the plurality of endpoint devices; andupon receipt of packets from the host device on the host unit interfaceand destined for a given endpoint device of the plurality of endpointdevices, establish a logical tunnel across the network fabric betweenthe first DPU integrated circuit and a second DPU integrated circuit ofthe plurality of DPUs to which the given endpoint device is locallyattached, encapsulate the packets using a transport protocol, and sendthe encapsulated packets over the logical tunnel to the second DPUintegrated circuit.

In a further example, this disclosure is directed to a method comprisingconfiguring, by a first DPU of a plurality of DPUs interconnected via anetwork fabric and implementing proxy logic for a system bus connection,a host unit interface of the first DPU to operate in a first mode for asystem bus connection by which the host unit interface is locallyattached to a host device, wherein the plurality of DPUs and the networkfabric together operate as a single system bus connection proxy, andwherein the host unit interface of the first DPU is configured toprovide access to the single system bus connection proxy operating as atleast one of a virtual switch attached to one or more of a plurality ofendpoint devices or as a virtual device implemented as an abstraction ofone or more of the plurality of endpoint devices. The method furthercomprises receiving, on the host unit interface of the first DPU,packets from the host device on the host unit interface, wherein thepackets are destined for a given endpoint device of the plurality ofendpoint devices; establishing a logical tunnel across the networkfabric between the first DPU and a second DPU of the plurality of DPUsto which the given endpoint device is locally attached; encapsulatingthe packets using a transport protocol; and sending the encapsulatedpackets over the logical tunnel to the second DPU.

The details of one or more examples are set forth in the accompanyingdrawings and the description below. Other features, objects, andadvantages will be apparent from the description and drawings, and fromthe claims.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram illustrating an example network having a datacenter in which examples of the techniques described herein may beimplemented.

FIGS. 2A-2B are block diagrams illustrating various exampleimplementations of a PCIe proxy for device-to-device communications, inaccordance with the techniques of this disclosure.

FIG. 3 is a block diagram illustrating a system including an exampledata processing unit communicatively coupled to an example applicationprocessor via a PCIe connection.

FIG. 4 is a block diagram illustrating an example data processing unit,in accordance with the techniques of this disclosure.

FIG. 5 is a block diagram illustrating an example host unit of the dataprocessing unit from FIG. 4 , in accordance with the techniques of thisdisclosure.

FIG. 6 is a flow diagram illustrating an example operation forconverting between PCIe and Ethernet in a data processing unit, inaccordance with the techniques of this disclosure.

DETAILED DESCRIPTION

FIG. 1 is a block diagram illustrating an example network 8 having adata center 10 in which examples of the techniques described herein maybe implemented. This disclosure describes techniques for providing ascaled-out transport supported by interconnected data processing units(DPUs) 17 that operates as a single system bus connection proxy fordevice-to-device communications between storage nodes 12 and/or computenodes 13 within a data center 10. The disclosed techniques enabledisaggregation of application processors, storage devices, networkinterface controllers (NICs), field programmable gate arrays (FPGAs), orother endpoint devices connected via the scaled-out transport. In theexample of FIG. 1 , various data structures and processing techniquesare described with respect to DPUs 17 within data center 10. Otherdevices within a network, such as routers, switches, servers, firewalls,gateways, and the like, having multiple core processor systems mayreadily be configured to utilize the data processing techniquesdescribed herein.

Data center 10 represents an example of a system in which varioustechniques described herein may be implemented. In general, data center10 provides an operating environment for applications and services forcustomers 11 coupled to the data center service provider network 7 andgateway device 20. In other examples, service provider network 7 may bea data center wide-area network (DC WAN), private network, or other typeof network. Data center 10 may, for example, host infrastructureequipment, such as compute nodes, networking and storage systems,redundant power supplies, and environmental controls. Service providernetwork 7 may be coupled to one or more networks administered by otherproviders, and may thus form part of a large-scale public networkinfrastructure, e.g., the Internet.

In some examples, data center 10 may represent one of manygeographically distributed network data centers. In the example of FIG.1 , data center 10 is a facility that provides information services forcustomers 11. Customers 11 may be collective entities such asenterprises and governments or individuals. For example, a network datacenter may host web services for several enterprises and end users.Other exemplary services may include data storage, virtual privatenetworks, file storage services, data mining services, scientific- orsuper-computing services, and so on.

Software-defined networking (SDN) controller 21 provides a high-levelcontroller for configuring and managing the routing and switchinginfrastructure of data center 10. SDN controller 21 provides a logicallyand in some cases physically centralized controller for facilitatingoperation of one or more virtual networks within data center 10. In someexamples, SDN controller 21 may operate in response to configurationinput received from a network administrator. Although not shown, datacenter 10 may also include, for example, one or more non-edge switches,routers, hubs, gateways, security devices such as firewalls, intrusiondetection, and/or intrusion prevention devices, servers, computerterminals, laptops, printers, databases, wireless mobile devices such ascellular phones or personal digital assistants, wireless access points,bridges, cable modems, application accelerators, or other networkdevices.

In the example of FIG. 1 , data center 10 includes a set of storagenodes 12 and compute nodes 13 interconnected via a high-speed networkfabric 14. In some examples, storage nodes 12 and compute nodes 13 arearranged into multiple different groups, each including any number ofnodes up to, for example, n storage nodes 12 ₁-12 _(n) and m computenodes 13 ₁-13 _(m) (collectively, “storage nodes 12” and “compute nodes13”). Storage nodes 12 and compute nodes 13 provide storage andcomputation facilities, respectively, for applications and dataassociated with customers 11 and may be physical (bare-metal) servers,virtual machines running on physical servers, virtualized containersrunning on physical servers, or combinations thereof.

As illustrated, each of storage nodes 12 and compute nodes 13 is coupledto network fabric 14 by a data processing unit (DPU) 17 for processingstreams of information, such as network packets or storage packets. Inexample implementations, DPUs 17 are configurable to operate in astandalone network appliance having one or more DPUs. For example, DPUs17 may be arranged into multiple different DPU groups 19, each includingany number of DPUs up to, for example, x DPUs 17 ₁-17 _(x). In otherexamples, each DPU may be implemented as a component (e.g., electronicchip) within a device, such as a compute node, storage node, orapplication server, and may be deployed on a motherboard of the deviceor within a removable card, such as a storage and/or network interfacecard.

In general, each DPU group 19 may be configured to operate as ahigh-performance I/O hub designed to aggregate and process networkand/or storage I/O for multiple storage nodes 12 and compute nodes 13.As described above, the set of DPUs 17 within each of the DPU groups 19provides highly-programmable, specialized I/O processing circuits forhandling networking and communications operations on behalf of storagenodes 12 and compute nodes 13.

As further described herein, in one example, each DPU 17 is a highlyprogrammable I/O processor specially designed for offloading certainfunctions from storage nodes 12 and compute nodes 13. In one example,each DPU 17 includes a number of internal processor clusters, eachincluding two or more processing cores and equipped with hardwareengines that offload cryptographic functions, compression, and regularexpression (RegEx) processing, data storage functions includingdeduplication and erasure coding, and networking operations. In thisway, each DPU 17 includes components for fully implementing andprocessing network and storage stacks on behalf of one or more storagenodes 12 or compute nodes 13. In addition, DPUs 17 may beprogrammatically configured to serve as a security gateway for itsrespective storage nodes 12 and/or compute nodes 13, freeing up theprocessors of the nodes to dedicate resources to application workloads.In some example implementations, each DPU 17 may be viewed as a networkinterface subsystem that implements full offload of the handling of datapackets (with zero copy in server memory) and storage acceleration forthe attached nodes. In one example, each DPU 17 may be implemented asone or more application-specific integrated circuit (ASIC) or otherhardware and software components, each supporting a subset of thestorage nodes 12 and/or compute nodes 13.

DPUs 17 may also be referred to as access nodes, or devices includingaccess nodes. In other words, the term access node may be used hereininterchangeably with the term DPU. Additional example details of variousexample DPUs and access nodes are described in U.S. Pat. No. 10,659,254,issued May 19, 2020 (Attorney Docket No. 1242-005US01); U.S. PatentPublication No. 2019/0012278, published Jan. 10, 2019 (Attorney DocketNo. 1242-004US01); and U.S. Pat. No. 10,725,825, issued Jul. 28, 2020(Attorney Docket No. 1242-048US01), the entire contents of each beingincorporated herein by reference.

In the example of FIG. 1 , each DPU 17 provides connectivity to networkfabric 14 for a different group of storage nodes 12 and/or compute nodes13 and may be assigned respective IP addresses and provide routingoperations for storage nodes 12 and/or compute nodes 13 coupled thereto.DPUs 17 may interface with and utilize network fabric 14 so as toprovide any-to-any interconnectivity such that any of storage nodes 12and/or compute nodes 13 may communicate packet data for a given packetflow to any other of the nodes using any of a number of parallel datapaths within the data center 10. In addition, DPUs 17 described hereinmay provide additional services, such as storage (e.g., integration ofsolid-state storage devices), security (e.g., encryption), acceleration(e.g., compression), I/O offloading, and the like. In some examples, oneor more of DPUs 17 may include storage devices, such as high-speedsolid-state drives or rotating hard drives, configured to providenetwork accessible storage for use by applications executing on thenodes. More details on the data center network architecture andinterconnected DPUs illustrated in FIG. 1 are available in U.S. Pat. No.10,686,729, issued Jun. 16, 2020 (Attorney Docket No. 1242-002US01), theentire contents of which are incorporated herein by reference.

An example architecture of DPUs 17 is described below with respect toFIG. 3 . The architecture of each DPU 17 comprises a multiple coreprocessor system that represents a high performance, hyper-convergednetwork, storage, and data processor and input/output hub. Thearchitecture of each DPU 17 is optimized for high performance and highefficiency stream processing. DPUs 17 may process stream information bymanaging “work units.” In general, a work unit (WU) is a container thatis associated with a stream state and used to describe (i.e., point to)data within a stream (stored in memory) along with any associatedmeta-data and operations to be performed on the data.

Although DPUs 17 are described in FIG. 1 with respect to network fabric14 of data center 10, in other examples, DPUs may provide full meshinterconnectivity over any packet switched network. For example, thepacket switched network may include a local area network (LAN), a widearea network (WAN), or a collection of one or more networks. The packetswitched network may have any topology, e.g., flat or multi-tiered, aslong as there is network connectivity between the DPUs. The packetswitched network may use any technology, including IP over Ethernet aswell as other technologies. Irrespective of the type of packet switchednetwork, DPUs may spray individual packets for packet flows between theDPUs and across multiple parallel data paths in the packet switchednetwork and reorder the packets for delivery to the destinations.

Each of DPUs 17 may include a set of host unit interfaces to connect tostorage nodes 12 and/or compute nodes 13. The host unit interfaces maybe, for example, Peripheral Component Interconnect Express (PCIe)interfaces. In accordance with the techniques of this disclosure, hostunits (e.g., PCIe controllers) of DPUs 17 may support host unitinterfaces configured to operate in rootport (RP) and endpoint (EP)modes for application having well-defined and open protocols, e.g.,non-volatile memory express (NVMe) applications, at end PCIe devices(e.g., PCIe host and PCIe endpoint devices) as well as in upstream (UP)and downstream (DN) virtual switch modes for other applications havingunknown protocols at end PCIe devices. More details on the dynamicconfiguration of host unit interfaces to support RP and EP modes areavailable in U.S. Patent Publication No. 2020/0073840, published Mar. 5,2020 (Attorney Docket No. 1242-045US01), the entire contents of whichare incorporated herein by reference.

The need for increased storage performance and capacity is rapidlyincreasing with the growth of big-and-fast data workloads such asartificial intelligence (AI) and analytics. Along with the need for moreand faster storage solutions comes more stringent reliability,availability, and security requirements. The emergence of faster, morereliable, and more efficient storage, such as SSDs, along with improvedstorage protocols, such as NVMe and NVMe over Fabrics, have exposedbottlenecks in the storage stack itself, which is commonly implementedto run on CPUs. By disaggregating storage and decoupling storage fromcompute, CPU bottlenecks may be reduced. In addition, to the storagebenefits, disaggregation of compute, storage, and networking componentsinto separate servers enables these components to be pooled foron-demand deployment within data centers. The disaggregation ofresources, however, requires that data that used to travel on aninternal PCIe system bus now flows across a network, significantlyincreasing network traffic within data centers.

This disclosure describes techniques for providing a scaled-outtransport supported by the interconnected DPUs 17 that operates as asingle system bus connection proxy for device-to-device communicationsbetween storage nodes 12 and/or compute nodes 13 within data center 10.As one example, this disclosure describes techniques for providing aPCIe proxy for device-to-device communications employing the PCIestandard. In accordance with the techniques described in thisdisclosure, the PCIe proxy supports disaggregation of resources withinthe data center by operating as either a virtual PCIe switch or avirtual PCIe device.

In one example, the techniques provide a physical approach todisaggregation in which the interconnected DPUs 17 operate as a locallyattached PCIe switch used to connect a PCIe host device (e.g., one ofcompute nodes 13) to a remotely located PCIe endpoint device (e.g., oneof storage nodes 12) using PCIe over network fabric 14. As anotherexample, the techniques provide a logical approach to disaggregation inwhich at least one of DPUs 17 may operate as a locally attached PCIedevice from the perspective of the PCIe host device where the locallyattached PCIe device is in effect a virtual device used to abstract oneor more PCIe endpoint devices, which may be locally attached or remotelyconnected to the DPU 17.

The disclosed techniques include adding PCIe proxy logic on top of hostunits of each of DPUs 17 to expose a PCIe proxy model to storage nodes12, compute nodes 13, or other end PCIe devices such as NICs or FPGAs.In some examples, the PCIe proxy logic implemented on DPU 17 exposesboth local PCIe device functionality and PCIe switch functionality forremotely attached PCIe devices. The PCIe proxy model may be implementedas a physically distributed Ethernet-based switch fabric with PCIe proxylogic at the edge in DPUs 17 and fronting the end PCIe devices (e.g.,storage nodes 12 and/or compute nodes 13). In accordance with thedisclosed techniques, the interconnected DPUs 17 and the distributedEthernet-based switch fabric together provide a reliable, low-latency,and scaled-out transport that operates as a PCIe proxy. The scaled-outtransport is transparent to the end PCIe devices (e.g., storage nodes 12and/or compute nodes 13) in that it is logically a locally attached PCIeswitch or a locally attached PCIe device from the end PCIe devices'perspectives, as long as the reliability and latency provided by thescaled-out transport are substantially similar to that provided by PCIe.As discussed above, the host unit of each of DPUs 17 may comprise a PCIecontroller configured to support both applications having well-definedand open protocols, e.g., NVMe applications, and other of applicationshaving unknown protocols at end PCIe devices.

The techniques described in this disclosure include a tunnel transportprotocol used by interconnected DPUs 17 of the scaled-out transport tomaintain the capabilities of a PCIe switch and make any PCIe devicesexposed by DPUs 17 appear to be locally connected to the PCIe hostdevice. The PCIe proxy logic implemented on each of DPUs 17 convertsbetween PCIe and Ethernet in which multiple PCIe transaction layerpackets (TLPs) may be included in each Ethernet frame. Morespecifically, the PCIe proxy logic supports tunneling PCIe over thescaled-out transport using the tunnel transport protocol over IP overEthernet encapsulation. Tunneling PCIe using the tunnel transportprotocol over IP over Ethernet, as opposed to assigning a new Ethertypeas in the case of PCIe over Ethernet, enables Layer 3 (L3) routing tooccur within the scaled-out transport.

The PCIe proxy logic implemented on each of DPUs 17 also supportsreliable transmission of the encapsulated packets within the scaled-outtransport, and maintains PCIe ordering and deadlock preventionsolutions. For example, the PCIe proxy logic may require receipt ofacknowledgements for transmitted packets and retransmission for alldropped or lost packets. The PCIe proxy logic may further supportsecurity within the scaled-out transport by using encrypted andauthenticated tunnels. Moreover, the PCIe proxy logic may provide hotplug support with dynamic provisioning and allocation for gracefullinking and unlinking of PCIe endpoint devices. The PCIe proxy logic mayfurther enable remote direct memory access (RDMA) from any RDMA capabledevice connected to network fabric 14 to proxied PCIe endpoint devices.

In some examples, the tunnel transport protocol may comprise a versionof a Fabric Control Protocol (FCP) that supports reliable transmission.FCP may be used by the different operational networking components ofany of DPUs 17 to facilitate communication of data across network fabric14. FCP is an end-to-end admission control protocol in which, in oneexample, a sender explicitly requests a receiver with the intention totransfer a certain number of bytes of payload data. In response, thereceiver issues a grant based on its buffer resources, QoS, and/or ameasure of fabric congestion. In general, FCP enables spraying ofpackets of a flow to all paths between a source and a destination node.More details on the FCP are available in U.S. Patent Publication No.2019/0104206, published Apr. 4, 2019 (Attorney Docket No. 1242-003US01),the entire content of which is incorporated herein by reference.

FIGS. 2A-2B are block diagrams illustrating various exampleimplementations of a PCIe proxy for device-to-device communications, inaccordance with the techniques of this disclosure.

FIG. 2A illustrates an example data center 10A that includes a centralprocessing unit (CPU) 24 and three graphics processing units (GPUs)26A-26C (collectively “GPUs 26”) that are each coupled to a networkfabric 14A via one of DPUs 27A-27D (collectively “DPUs 27”). DPUs 27 mayoperate substantially similar to DPUs 17 from FIG. 1 . Each of CPU 24and GPUs 26 may be included in different compute nodes 13 from FIG. 1 .

According to the disclosed techniques, each of DPUs 27 is configured toimplement PCIe proxy logic to support four modes for each of its hostunit interfaces or ports as endpoint (EP) and rootport (RP) forapplications having well-defined and open protocols, e.g., NVMe, andswitch upstream (UP) and switch downstream (DN) for unknown applicationprotocols. In the example of FIG. 2A, each of GPUs 26A, 26B, 26C iscoupled to a respective one of DPUs 27A, 27B, 27C via a host unitinterface 25A, 25B, 25C operating in a DN mode. CPU 24 is coupled to DPU27D via a host unit interface 25D operating in an UP mode.

The PCIe proxy logic and supported host unit interface modes enable theDPUs 27 to operate as PCIe proxy 23 positioned between CPU 24 and GPUs26. In accordance with the techniques of this disclosure, DPUs 27 andPCIe proxy 23 together provide a reliable, low-latency, and scaled-outtransport that is transparent to the end PCIe devices, i.e., CPU 24 andGPUs 26, as long as the reliability and latency provided by thescaled-out transport are substantially similar to that provided by PCIe.In the example of FIG. 2A, PCIe proxy 23 operates as a logical PCIeswitch from the perspective of CPU 24 and GPUs 26 such that CPU 24 andGPUs 26 each appear to be locally attached to a PCIe switch. In theexample of FIG. 2A, PCIe proxy 23 is logically a 4-port PCIe switch inwhich CPU 24 communicates with three GPUs 26. In other examples where aCPU communicates directly to a GPU using a dedicated PCIe port, a PCIeproxy operates as a logical 2-port PCIe switch.

The techniques described in this disclosure enable disaggregation ofstorage nodes 12, compute nodes 13, or other PCIe endpoint devicesconnected via a PCIe proxy. For example, as illustrated in FIG. 2A, PCIeproxy 23 may enable decoupling of the conventional static allocation ofGPUs to specific CPUs. As shown in FIG. 2A, GPUs 26 may be housed in asingle location as a pool 28 anywhere within data center 10. Using thePCIe proxy logic implemented on DPUs 27, pool 28 of GPUs 26 may beshared across multiple compute nodes 13 and/or shared across multiplecustomers 11 (FIG. 1 ). In the example of FIG. 2A, any of GPUs 26 may beallocated and provisioned from pool 28 to CPU 24. In this way, the PCIeproxy logic implemented on DPUs 27 decouples the typical staticallocation of GPUs to specific CPUs, and instead enables CPU 24 toutilize any of GPUs 26 within the remotely located pool 28 as thoughGPUs 26 were locally attached to CPU 24.

The ability to disaggregate GPUs from specific CPUs further enables datacenters to be built with fewer resources. In this way, each compute node13 within a data center does not need to include all the necessaryresources as locally attached devices. For example, GPUs are expensiveand may not be fully utilized by one or more locally attached CPUs. ThePCIe proxy logic implemented on DPUs 27, as described in thisdisclosure, enables GPUs 26 to be shared as virtualized GPUs between aplurality of remotely located CPUs within data center 10A. In this way,none of GPUs 26 may be statically assigned to CPU 24. Instead, DPU 27Amay implement a virtualized GPU as an abstraction of remotely connectedGPU 26A, for example, and the virtualized GPU may be allocated andprovisioned to CPU 24. In some examples, another virtual GPU as anabstracted version of GPU 26A may be allocated and provisioned toanother CPU (not shown) in data center 10A.

In the example of FIG. 2A, upon allocating and provisioning avirtualized version of GPU 26A from pool 28 to CPU 24, DPU 27D receivesPCIe TLPs from CPU 24 via host unit interface 25D operating in the UPmode. DPU 27D, implementing the PCIe proxy logic, determines that thePCIe TLPs are destined for remotely located GPU 26A, and converts fromPCIe to Ethernet by packing the PCIe TLPs into Ethernet frames. DPU 27Dthen tunnels the PCIe packets over PCIe proxy to DPU 27A using a tunneltransport protocol over IP over Ethernet encapsulation. DPU 27A receivesthe encapsulated packets from DPU 27D. DPU 27A, implementing the PCIeproxy logic, decapsulates the PCIe TLPs from the Ethernet frames toconvert from Ethernet back to PCIe. DPU 27A then forwards the PCIe TLPsto GPU 26A for processing. DPU 27A forwards the PCIe TLPs to GPU 26A viahost unit interface 25A operating in the DN mode. In accordance with thedisclosed techniques, GPU 26A and the processing performed by GPU 26Aappear to be locally attached to a PCIe switch from the perspective ofCPU 24.

FIG. 2B illustrates an example data center 10B that includes multipledifferent application processors (e.g., CPU 30 and GPUs 34, 38, 40) andstorage devices (e.g., SSDs 32, 36) that are each coupled to a networkfabric 14B via one of DPUs 37A-37E (collectively “DPUs 37”). DPUs 37 mayoperate substantially similar to DPUs 17 from FIG. 1 . Each of CPU 30and GPUs 34, 38, 40 may be included in different compute nodes 13 fromFIG. 1 , and each of SSDs 32, 36 may be included in different storagenodes 12 from FIG. 1 .

Similar to FIG. 2A, each of DPUs 37 is configured to implement PCIeproxy logic to support the four modes (i.e., EP, RP, UP, DN) for each ofits host unit interfaces. The supported host unit interface modes enablethe DPUs 37 to operate as PCIe proxy 33 positioned between theapplication processors and storage nodes. In the example of FIG. 2B, CPU30 is coupled to DPU 37A via a first host unit interface 35A havingmultiple functions including a first function operating in an UP modefor unknown application protocols and a second function operating in anEP mode for known application protocols, e.g., NVMe. Each of SSDs 32, 36is coupled to a respective one of DPUs 37B, 37D via a host unitinterface 35B, 35D operating in a RP mode. Each of GPUs 34, 38, 40 iscoupled to a respective one of DPUs 37C, 37D, 37E via a host unitinterface 35C, 35E, 35F operating in a DN mode.

As shown in FIG. 2B, at least some of DPUs 37 may include multiple hostunit controllers to enable multiple host unit interfaces operating indifferent modes and/or multiple functions of a single host unitinterface operating in different modes to co-exist on the same DPU andbelong to the same PCIe proxy 33. For example, for PCIe proxy 33, DPU37A includes a single host unit interface 35A having a first functionoperating in the UP mode and a second function operating in the EP mode.DPU 37D includes first host unit interface 35D operating in the RP modeand second host unit interface 35E operating in the DN mode. In someother examples, a single DPU may have both an UP function and a DNfunction that co-exist and belong to the same PCIe proxy as eitherseparate host unit interfaces or within a single host unit interface. Inaddition, one or more of DPUs 37 may expose at least one host unitinterface for local PCIe functions as well as for PCIe switch functionsfor remotely attached PCI devices.

In examples where CPU 30 executes an application that uses NVMe as astorage protocol to transfer data between CPU 30 and SSDs 32, 36, DPU37A exposes host unit interface 35A as an NVMe EP to CPU 30. Host unitinterface 35A may include two PCIe functions or branches—one functionoperating in switch upstream (UP) mode used to physically connect CPU 30to GPUs 34, 38, 40 attached to DN mode ports using PCIe over networkfabric 14B and the other function operating in EP mode used to logicallyconnect CPU 30 to a virtual PCIe device (e.g., a virtual SSD)implemented by DPU 37A. CPU 30 performs PCIe enumeration on host unitinterface 35A of DPU 37A and discovers the UP function and the EPfunction. For the UP function, CPU 30 sees PCIe proxy 33 as a PCIeswitch having an upstream port and one or more downstream ports attachedto PCIe endpoint devices, such as GPUs 34, 38, 40. For the EP function,CPU 30 only sees that the locally attached device provides EPfunctionality but does not know where the EP is physically implemented,i.e., PCIe proxy 33 is invisible and appears to CPU 30 as a locallyattached PCIe device. According to the disclosed techniques, DPUs 37 andnetwork fabric 14B comprising PCIe proxy 33 are configured todisaggregate or extend the EP functionality across network fabric 14B toone or more remote SSDs, such as SSDs 32, 36. For example, the EPfunctionality of host unit interface 35A may be extended toward remoteSSD 32 by binding together the independent RP/EP trees of CPU 30 to DPU37A and DPU 37B to SSD 32.

In the case where CPU 30 selects the EP function of host unit interface35A of DPU 37A, PCIe proxy 33 operates as the virtual SSD, which may bean abstraction of one or more physical SSDs that are locally or remotelyattached to DPU 37A (e.g., SSDs 32, 36). At the other side of PCIe proxy33, DPU 37B exposes host unit interface 35B as an NVMe RP to SSD 32 andDPU 37D exposes host unit interface 35D as an NVMe RP to SSD 36.

The NVMe protocol, being a well-known and open protocol, enables DPUs 37to terminate the RP/EP tree with CPU 30 and intercept the trafficbetween CPU 30 and one or more of SSDs 32, 36 to provideapplication-level features. In this way, CPU 30 accesses each of SSDs32, 36 via DPUs 37 using EP and RP functionality. In some examples, theEP and RP functionality may also be used by DPUs 37 to enabledisaggregation of other types of PCIe endpoint devices, such as NICs orFPGAs, that are either remotely or locally attached to the DPUs. Inother examples, the UP and DN functionality may be used by DPUs 37 toenable disaggregation of SSDs by not terminating any PCIe trees andstaying at a PCIe transport level for the scale out, but this comes atthe cost of losing the application-level features.

In other examples where CPU 30 executes an application that uses anunknown application protocol to transfer data between CPU 30 and GPUs34, 38, 40, DPUs 37 cannot use the EP/RP modes to intercept traffic inorder to achieve scale out. In the case where CPU 30 selects the UPfunction of host unit interface 35A of DPU 37A to access remotelyattached GPUs 34, 38, 40, PCIe proxy 33 operates as a virtual PCIeswitch between CPU 30 and GPUs 34, 38, 40. In accordance with thetechniques of this disclosure, when operating in the UP/DN modes, DPUs37 are configured to stay at a PCIe transport level for the scale out.In this case, DPUs 37 and PCIe proxy 33 provide a reliable, low-latencyand scaled-out transport between end PCIe devices (e.g., CPU 30 and GPUs34, 38, 40), and ensure that the transport is logically a PCIe switchfrom the perspective of the end PCIe devices. At the other side of PCIeproxy 33, DPU 37C exposes host unit interface 35C as a DN port to GPU34, DPU 37D exposes host unit interface 35E as a DN port to GPU 38, andexposes host unit interface 35F as a DN port to GPU 40. In this way, CPU30 and GPUs 34, 38, 40 communicate with each other via the PCIe proxylogic implemented on each of DPUs 37. In some examples, the UP and DNfunctionality may also be used by DPUs 37 to enable disaggregation ofother types of PCIe endpoint devices, such as NICs or FPGAs, that areeither remotely or locally attached to the DPUs.

FIG. 3 is a block diagram illustrating a system 58 including an exampleDPU 60 communicatively coupled to an example application processor(i.e., CPU 90) via a PCIe connection. As illustrated in FIG. 3 , DPU 60includes a run-to-completion data plane operating system (OS) 62configured to process work units. Each of DPU 60 and CPU 90 generallyrepresents a hardware chip implemented in digital logic circuitry. DPU60 and CPU 90 may be hosted on the same or different computing devices.DPU 60 may operate substantially similar to any of DPUs 17, 27, or 37from FIGS. 1-2B. CPU 90 may operate substantially similar to any of CPUs24 or 30 from FIGS. 2A-2B. In the illustrated example of FIG. 3 , system58 also includes example storage devices (i.e., SSDs 88) communicativelycoupled to DPU 60 via a PCIe connection. SSDs 88 may operatesubstantially similar to any of SSDs 32, 36 from FIG. 2B.

DPU 60 is a highly programmable I/O processor with a plurality ofprocessing cores (as discussed below, e.g., with respect to FIG. 5 ). Inthe illustrated example of FIG. 3 , DPU 60 includes a network interface(e.g., an Ethernet interface) to connect directly to a network, and aplurality of host interfaces (e.g., PCIe interfaces) to connect directlyto one or more application processors (e.g., CPU 90) and one or morestorage devices (e.g., SSDs 88). DPU 60 also includes run-to-completiondata plane OS 62 executing on two or more of the plurality of processingcores. Data plane OS 62 provides data plane 64 as an executionenvironment for a run-to-completion software function invoked on dataplane OS 62 to process a work unit. The work unit is associated with oneor more stream data units (e.g., packets of a packet flow), andspecifies the software function for processing the stream data units andone processing core of the plurality of processing cores for executingthe software function.

The software function invoked to process the work unit may be one of aplurality of software functions for processing stream data included in alibrary 70 provided by data plane OS 62. In the illustrated example,library 70 includes network functions 72, storage functions 74, securityfunctions 76, and analytics functions 78. Network functions 72 may, forexample, include network I/O data processing functions related toEthernet, network overlays, networking protocols, encryption, andfirewalls. Storage functions 74 may, for example, include storage I/Odata processing functions related to NVMe (non-volatile memory express),compression, encryption, replication, erasure coding, and pooling.Security functions 76 may, for example, include security data processingfunctions related to encryption, regular expression processing, and hashprocessing. Analytics functions 78 may, for example, include analyticaldata processing functions related to a customizable pipeline of datatransformations.

In accordance with the techniques of this disclosure, network functions72 include PCIe proxy logic used to facilitate device-to-devicecommunications employing the PCIe standard over an Ethernet-based switchfabric. More specifically, the PCIe proxy logic converts between PCIeand Ethernet, and tunnels PCIe over the Ethernet-based switch fabricusing a tunnel transport protocol over IP over Ethernet encapsulation.The PCIe proxy logic uses the tunnel transport protocol to maintain PCIedata processing solutions (e.g., reliability, ordering, deadlockprevention, and hot plug support solutions) within the PCIe proxylogical tunnel. In addition, the PCIe proxy logic supports four modesfor each of its host unit interfaces (e.g., PCIe interfaces), includingEP and RP for applications having well-defined and open protocols, e.g.,NVMe, and switch UP and DN for unknown application protocols. In thisway, DPU 60, along with other DPUs interconnected by the Ethernet-basedswitch fabric, supports a scaled-out transport that is transparent tothe end PCIe devices (e.g., CPU 90 and SSDs 88) and operates as either avirtual PCIe switch or a virtual PCIe device.

In general, data plane OS 62 is a low level, run-to-completion operatingsystem running on bare metal of DPU 62 that runs hardware threads fordata processing and manages work units. Data plane OS 62 includes thelogic of a queue manager to manage work unit interfaces, enqueue anddequeue work units from queues, and invoke a software function specifiedby a work unit on a processing core specified by the work unit. In therun-to-completion programming model, data plane OS 62 is configured todequeue a work unit from a queue, process the work unit on theprocessing core, and return the results of processing the work unit tothe queues.

DPU 60 also includes a multi-tasking control plane operating systemexecuting on one or more processing cores of the plurality of processingcores. In some examples, the multi-tasking control plane operatingsystem may comprise Linux, Unix, or a special-purpose operating system.In some examples, as illustrated in FIG. 3 , data plane OS 62 provides acontrol plane 66 including a control plane software stack executing ondata plane OS 62. As illustrated, the control plane software stackincludes a hypervisor 80, a multi-tasking control plane OS 82 executingon hypervisor 80, and one or more control plane service agents 84executing on control plane OS 82. Hypervisor 80 may operate to isolatecontrol plane OS 82 from the work unit and data processing performed ondata plane OS 62. Control plane service agents 84 executing on controlplane OS 82 comprise application level software configured to performset up and tear down of software structures to support work unitprocessing performed by the software function executing on data plane OS62. In the example of data packet processing, control plane serviceagents 84 are configured to set up the packet flow for data packetprocessing by the software function on data plane OS 62, and tear downthe packet flow once the packet processing is complete. In this way, DPU60 comprises a highly programmable processor that can run applicationlevel processing while leveraging the underlying work unit datastructure for highly parallelized stream processing.

In another example, instead of running on top of data plane OS 62, themulti-tasking control plane operating system may run on one or moreindependent processing cores that are dedicated to the control planeoperating system and different than the processing cores executing dataplane OS 62. In this example, if an independent processing core isdedicated to the control plane operating system at the hardware level, ahypervisor may not be included in the control plane software stack.Instead, the control plane software stack running on the independentprocessing core may include the multi-tasking control plane operatingsystem and one or more control plane service agents executing on thecontrol plane operating system.

CPU 90 is an application processor with one or more processing coresoptimized for computing-intensive tasks. In the illustrated example ofFIG. 3 , CPU 90 includes a plurality of host interfaces (e.g., PCIeinterfaces) to connect directly to DPU 60. CPU 90 includes ahypervisor/OS 92 that supports one or more service agents 96 and one ormore drivers 97. As illustrated in FIG. 3 , CPU 90 may also include avirtual machine (VM) OS 94 executing on top of hypervisor/OS 92 thatsupports one or more drivers 98. Application level software, such asagents 96 or drivers 97 executing on OS 92 or drivers 98 executing on VMOS 94, of CPU 90 may determine which data processing tasks to offloadfrom CPU 90 to DPU 60. In accordance with the techniques of thisdisclosure, CPU 90 may send PCIe TLPs destined for either a remotelyconnected or locally attached PCIe endpoint device, e.g., a GPU, SSD,NIC, or FPGA, to DPU 60 using physical functions (PFs) and/or virtualfunctions (VFs) of PCIe links for further transmission over thescaled-out transport via the PCIe proxy logic implemented on DPU 60.Similarly, CPU 90 may send PCIe TLPs to a VF of DPU 60 on behalf of VMOS 94. From the perspective of CPU 90, DPU 60 appears to be either alocally attached PCIe switch or the locally attached PCIe device.

In the illustrated example of FIG. 3 , system 58 also includes acontroller 100 in communication with both DPU 60 and CPU 90 via acontrol application programming interface (API). Controller 100 mayprovide a high-level controller for configuring and managing applicationlevel software executing on a control plane operating system of each ofDPU 60 and CPU 90. For example, controller 100 may configure and managewhich data processing tasks are to be offloaded from CPU 90 to DPU 60.In some examples, controller 100 may comprise a software-definednetworking (SDN) controller, which may operate substantially similar tocontroller 21 of FIG. 1 . In some examples, controller 100 may operatein response to configuration input received from a network administratorvia an orchestration API.

FIG. 4 is a block diagram illustrating an example data processing unit130, in accordance with the techniques of this disclosure. DPU 130generally represents a hardware chip implemented in digital logiccircuitry. DPU 130 may operate substantially similar to any of the DPUs17 of FIG. 1 , DPUs 27 of FIG. 2A, DPUs 37 of FIG. 2B, or DPU 60 of FIG.3 . Thus, DPU 130 may be communicatively coupled to one or more storagenodes, compute nodes, CPUs, GPUs, FPGAs, SSDs, network devices, serverdevices, storage devices, network fabrics, or the like, e.g., via anetwork interface such as Ethernet (wired or wireless), a system busconnection interface such as PCIe, or other such communication media.

In the illustrated example of FIG. 4 , DPU 130 includes a plurality ofprogrammable processing cores 140A-140N (“cores 140”). DPU 130 alsoincludes a networking unit 142, a plurality of work unit (WU) queues144, and at least one host unit 146 having a mode unit 147. Although notillustrated in FIG. 4 , each of cores 140, networking unit 142, WUqueues 144, and host unit 146 are communicatively coupled to each other.In accordance with the techniques of this disclosure, PCIe proxy logic148 and transport protocol tunneling unit 150 may be implemented on DPU130 to provide a scaled-out transport that operates as a PCIe proxy forend PCIe devices (e.g., CPUs, GPUs, other compute nodes, SSDs, otherstorage nodes, NICs, and/or FPGAs).

In this example, DPU 130 represents a high performance, hyper-convergednetwork, storage, and data processor and input/output hub. For example,networking unit 142 may be configured to send and receive stream dataunits with one or more external devices, e.g., network devices.Networking unit 142 may perform network interface card functionality,packet switching, and the like, and may use large forwarding tables andoffer programmability. Networking unit 142 may expose network interface(e.g., Ethernet) ports for connectivity to a network, such as networkfabric 14 of FIG. 1 . Host unit 146 may expose one or more host unitinterface (e.g., PCIe) ports to send and receive stream data units withend PCIe devices (e.g., PCIe host and PCIe endpoint devices). DPU 130may further include one or more high bandwidth interfaces forconnectivity to off-chip external memory (not illustrated in FIG. 4 ).

At least one of WU queues 144 may be associated with each of cores 140and configured to store a plurality of work units enqueued forprocessing on the respective one of the cores 140. In some examples,each of cores 140 may have a dedicated one of WU queues 144 that storeswork units for processing by the respective one of cores 140. In otherexamples, each of cores 140 may have two or more dedicated WU queues 144that store work units of different priorities for processing by therespective one of cores 140.

Cores 140 may comprise one or more of MIPS (microprocessor withoutinterlocked pipeline stages) cores, ARM (advanced RISC (reducedinstruction set computing) machine) cores, PowerPC (performanceoptimization with enhanced RISC-performance computing) cores, RISC-V(RISC five) cores, or complex instruction set computing (CISC or x86)cores. Each of cores 140 may be programmed to process one or more eventsor activities related to a given packet flow such as, for example, anetworking packet flow, a storage packet flow, a security packet flow,or an analytics packet flow. Each of cores 140 may be programmable usinga high-level programming language, e.g., C, C++, or the like.

In some examples, the plurality of cores 140 executes instructions forprocessing a plurality of events related to each data packet of a packetflow, received by networking unit 142 or host unit 146, in a sequentialmanner in accordance with one or more work units associated with thedata packets. Work units are sets of data exchanged between cores 140and networking unit 142 or host unit 146 where each work unit mayrepresent one or more of the events related to a given data packet. Morespecifically, a work unit is associated with one or more data packets,and specifies a software function for processing the data packets andfurther specifies one of cores 140 for executing the software function.

In general, to process a work unit, the one of cores 140 specified bythe work unit is configured to retrieve the data packets associated withthe work unit from a memory, and execute the software function specifiedby the work unit to process the data packets. For example, receiving awork unit is signaled by receiving a message in a work unit receivequeue (e.g., one of WU queues 144). Each of WU queues 144 is associatedwith one of cores 140 and is addressable in the header of the work unitmessage. Upon receipt of the work unit message from networking unit 142,host unit 146, or another one of cores 140, the work unit is enqueued inthe one of WU queues 144 associated with the one of cores 140 specifiedby the work unit. The work unit is later dequeued from the one of WUqueues 144 and delivered to the one of cores 140. The software functionspecified by the work unit is then invoked on the one of cores 140 forprocessing the work unit. The one of cores 140 then outputs thecorresponding results of processing the work unit back to WU queues 144.

More details on the components and functionality of DPUs are describedin U.S. Patent Publication No. 2019/0012278, published Jan. 10, 2019(Attorney Docket No. 1242-004US01) and U.S. Pat. No. 10,725,825, issuedJul. 28, 2020 (Attorney Docket No. 1242-048US01), the entire contents ofeach being incorporated herein by reference.

In order to support the PCIe proxy described herein, PCIe proxy logic148 is added on top of host unit 146 to expose a PCIe proxy model to theend PCIe devices communicatively coupled to DPU 130 via host unit 146.In some examples, PCIe proxy logic 148 may comprise softwarefunctionality executed by one or more of processing cores 140. In otherexamples, PCIe proxy logic 148 may comprise hardware (i.e., logiccircuits) implemented on DPU 130. In accordance with the disclosedtechniques, the PCIe proxy model is implemented as a reliable,low-latency, and scaled-out transport that is transparent to the endPCIe devices.

Host unit 146 may include one or more PCIe controllers (not shown inFIG. 4 ) configured to enable PCIe interfaces or ports of host unit 146to operate in different modes. More specifically, PCIe proxy logic 148supports four modes for each PCIe interface or port as EP/RP modes toNVMe storage nodes (e.g., SSDs) or other end PCIe devices using knownapplication protocols, and switch UP/DN modes to other end PCIe devices(e.g., devices including CPUs and GPUs) using unknown applicationprotocols. PCIe proxy logic 148 may set the operational modes for thePCIe interfaces or ports via mode unit 147 of host unit 146.

PCIe proxy logic 148 uses tunnel transport protocol unit 150 to maintainthe capabilities of a PCIe switch and make any PCIe devices exposed byDPU 130 appear to be locally attached to the PCIe host device. Transportprotocol tunneling unit 150 includes encapsulation unit 152 configuredto convert between PCIe and Ethernet by packing multiple PCIe TLPs intoeach Ethernet frame, and applying a tunnel transport protocol over IPover Ethernet encapsulation for tunneling PCIe over the scaled-outtransport. Tunneling PCIe using the tunnel transport protocol over IPover Ethernet, as opposed to assigning a new Ethertype in the case PCIeover Ethernet, enables L3 routing to occur within the scaled-outtransport. Transport protocol tunneling unit 150 also includes reliabletransmission unit 154 configured to support reliable transmission of theencapsulated packets within the scaled-out transport, and maintains PCIeordering and deadlock prevention solutions. Transport protocol tunnelingunit 150 includes encryption unit 156 to support security within thetransport by using encrypted tunnels. Moreover, transport protocoltunneling unit 150 includes hot plug support unit 158 configured toperform dynamic provisioning and allocation for graceful linking andunlinking of PCIe endpoint devices. Transport protocol tunneling unit150 may further enable RDMA from any RDMA capable device connected tothe network fabric to proxied PCIe endpoint devices. Finally, transportprotocol tunneling unit 150 implements network congestion avoidance andcontrol mechanisms.

In one example, when PCIe traffic is received by a PCIe interface orport of host unit 146 operating a Switch Upstream or DownstreamFunction, the traffic may be split into Slow Path and Fast Path. Forexample, valid PCIe configuration TLPs received by a Switch UpstreamFunction go through Slow Path where processors are involved to implementPCIe features in software. Other Switch TLPs may go through the FastPath in hardware for performance. In another example, when PCIe trafficis received by a PCIe interface or port of host unit 146 operating an RPor EP Function, there is no change to the RP traffic flow and the EPtraffic flow only changes for processing of configuration TLPs. Forconfiguration TLPs of an EP traffic flow, there is configuredinformation per PCIe controller of host unit 146 to either handle allsuch TLPs by the Fast Path or the Slow Path.

For Slow Path processing in software, the Configuration Request TLPsreceived by a PCIe Switch Upstream Function are directly sent over toone of cores 140. By processing the Configuration Request TLPs, softwareexecuted by cores 140 sets up the PCIe routing database in both thelocal DPU 130 housing the PCIe Switch Upstream Port on host unit 146 andall remote DPUs housing PCIe Switch Downstream Ports to enable DPU 130to perform Fast Path hardware routing of TLPs. Cores 140 may alsoperform PCIe capabilities that are not implemented by the hardware PCIecontrollers by composing completion TLPs to the PCIe host device fromwhich the Configuration Request TLPs were received.

For Fast Path hardware routing, memory request TLPs, completion TLPs,and ID-routed message TLPs, received by a PCIe Switch Function arerouted by host unit 146 hardware by looking up the routing database thatwas set up by cores 140 as described above. For each Switch UpstreamPort, software sets up a lookup database to support a maximum-scale PCIeSwitch, e.g., 1 Upstream Port and 32 Downstream Ports. Such database isduplicated to every DPU that houses one or more Switch Ports of the PCIeproxy. The database supports both memory address-based routing andID-based routing. The routing result is a flow index. The flow index maybe mapped to a global transport index for transport across a distributedEthernet-based switch fabric. Host unit 146 hardware pushes the TLPsinto a queue for the flow index. A per flow maximum transmission unit(MTU) in bytes will be programmed. For every MTU bytes of TLPsaccumulated, host unit 146 hardware sends out a WU carrying informationfor the accumulated bytes. The flow queues are also used by software toinject TLPs or special control packets that are for communicationbetween DPUs. One typical use is to communicate an outcome of the SlowPath processing.

The networking unit 142 that processes the WUs will free up storage oncethe TLPs are read for transmission, which allows host unit 146 hardwareto push more TLPs (and control packets) into the queue. That engine isalso responsible for: resolving the Downstream Port in the destinationDPU; encapsulating the TLPs and control packets into Ethernet packets;and delivering the encapsulated packets reliably to the destination DPUDownstream Port. The destination DPU will do the reverse: decapsulatethe Ethernet byte streams; retrieve the original TLPs and controlpackets; send the TLPs to the attached endpoint PCIe devices; andconsume/terminate the control packets in the destination DPU. Thedestination DPU may be remotely located or the same source DPU, e.g.,DPU 130, and may even be the same source port in the same source DPU.

The techniques of this disclosure are directed to building a directhardware path between host unit 146 and network unit 142, and building areliable and low-latency transport among networking units belonging todifferent DPUs. In this way, PCIe TLPs may be delivered across adistributed Ethernet-based switch fabric that operates as a PCIe proxy,i.e., a virtual PCIe device or a virtual PCIe switch. On a source DPU,e.g., DPU 130, one or more of cores 140 execute encapsulation unit 152to pack TLPs into Ethernet packets, and execute reliable transmissionunit 154 to reliably deliver the encapsulated packets to destinationDPUs in tunnels. On a destination DPU, the encapsulated packets areunpacked back to the original TLPs and sent to the destination PCIeendpoint devices.

Reliable transmission unit 154 enables DPU 130 to maintain PCIe orderingrules for correctness and deadlock avoidance even when the TLPs areencapsulated and tunneled over Ethernet. As one example, persource-controller and destination-controller pairs per each direction,reliable transmission unit 154 puts TLPs that have ordering requirementsinto the same tunnel and delivers the TLPs in strict order. In somecases, for a source-destination controller pair per each direction, TLPsdelivered in Switch Fast Path may be put into a single tunnel. Asanother example, reliable transmission unit 154 puts aside non-postedrequest TLPs to enable posted request TLPs and completion TLPs to bypassthem, in order to avoid deadlock. In some cases, each DPU that sinksnon-posted request TLPs have enough buffer to store them aside so as notto prevent later-arrived posted request TLPs or completion TLPs(delivered by the same tunnel) from being delivered to the local PCIelinks should non-posted request TLPs be flow-controlled by the PCIelink.

DPU software may inject packets carrying posted request TLPs into theFast Flow towards networking unit 142. These injected packets are packedthe same way as the normal TLPs. DPU hardware itself may also injectspecial packets to facilitate communication between DPU hardware withoutsoftware getting involved. The injected packets may include ahardware-injected read acknowledgement (Read Ack) message, asoftware-injected WU, or a software-injected posted request TLP.

Hot plug support unit 158 enables DPU 130 to support standard PCIehot-plug, async removal, and downstream port containment (DPC) evenacross the distributed Ethernet-based switch fabric. In a PCIeDownstream port (e.g., an RP or DN mode Port), PCIe has definedcapabilities to support both orderly addition/removal (i.e., standardhot-plug or sync hot-plug) and async removal of PCIe devices. TheStandard PCI hot-plug (including hot-add and hot-removal) is performedin a lock-step manner with the operating system through a well-definedsequence of user actions and system management facilities. Async removalrefers to the removal of an adapter or disabling a Downstream Port Linkdue to error containment without prior warning to the operation system.DPC (Downstream Port Containment) is a feature of a Downstream Port. DPChalts PCIe traffic below a downstream port after an unmaskeduncorrectable error is detected at or below the port, avoiding thepotential spread of any data corruption and permitting error recovery ifsupported by software. Among the errors that can trigger DPC, the“Surprise Down Error” is an important one and it is reported when theDownstream Link is down (for example, due to Async Removal of theattached device.

Hot plug support unit 158 further enables DPU 130 to support Switch PortContainment (SPC) by tracking non-posted request TLPs within ascaled-out PCIe Switch. SPC tracks every pending non-posted request TLPin the scaled-out PCIe Switch and, in case of link failures across thedistributed Ethernet-based switch fabric or peripheral PCIe links, DPUsare responsible to compose completion TLPs to avoid PCIe completiontimeout in CPUs or other PCIe endpoints (such as GPUs).

The SPC feature may only be turned on when a non-posted request TLP andits completion TLPs take the same path (on the opposite direction). TheSPC tracks non-posted request on links among the switch ports. Since inthis disclosure, a PCIe switch is physically disjointed and scaled-outinto a network, the virtual PCIe switch or PCIe proxy needs to recovergracefully when the connectivity/links within the scaled-out networkbreaks and causes completions never to come back. This scenario does notexist for a normal PCIe switch. The SPC also performs enhanced trackingof non-posted requests sent to the local Switch Downstream Port. The SPCtracks every non-posted request towards the local Switch DownstreamPort, independently of any triggered DPC. In addition, the SPC composesa completion for every pending non-posted request upon DPC being firedor when a hardware timeout occurs. In contrast, the standard DPC onlyrequires composing completions for non-posted requests arriving afterDPC is triggered. In addition, the SPC composes a completion for everypending non-posted request upon link down or when a hardware timeoutoccurs. For standard PCIe switches, all TLPs are to be discarded uponlink down, and no completions are required to be composed.

When completions take different paths from their non-posted requests,the SPC feature is turned off. The PCIe access control services (ACS)feature allows a non-posted request and its related completions to gothrough different paths. When different paths are taken, nocorresponding completion may be seen by the remote or local trackingtables. Further, for the remote tracking table, a new mechanism isdisclosed to clean up the pending non-posted requests in the remotetracking table in order to replenish the reserved buffer at thedestination DPU. In the new mechanism, a destination DPU sends specialRead_Ack messages back to the source DPU. A Read_Ack message echoes backthe index of the entry in the remote tracking table carried by thenon-posted request. Along with each index, the amount of buffer to berecovered is also returned. The “destination address” to deliver theRead_Ack is expected to be prepared by the network unit of the DPU thatreceived the non-posted request. A Read_Ack message can carry multipleindices of the remote tracking table.

Reliable transmission unit 154 further enables DPU 130 to implementsolutions for deadlock avoidance. A PCIe interconnect has strict bypassrules to avoid deadlock. One key bypass is to allow posted requests andcompletions to bypass non-posted requests when non-posted requestscannot make forward progress. In the virtual PCIe switch architecturedescribed herein, all TLPs of a given direction from a given source DPUswitch port are delivered to a given destination DPU switch port in asingle tunnel. With this single tunnel approach, for all non-postedrequests submitted into the tunnel, either the completions of thesenon-posted requests need to have guaranteed space in the source DPU or adestination DPU needs to have guaranteed space to store aside allreceived non-posted requests from all possible source DPUs.

FIG. 5 is a block diagram illustrating an example of host unit 146 ofDPU 130 from FIG. 4 , in accordance with the techniques of thisdisclosure. Host unit 146 includes at least one host unit (HU) slice 160having one or more PCIe controllers 166. In some examples, host unit 146may include multiple, identical HU slices.

HU slice 160 has one PCIe PHY 162 shared by four PCIe controllers166A-166D (collectively, “PCIe controllers 166”) via lane MUX 164. Inone example, PHY 162 comprises a x16 PCIe PHY. The 16 serdes from PHY162 may be shared by PCIe controllers 166 is different configurations.In one example, PCIe controller 166A may comprise a x16 PCIe controllerthat supports up to 16 PCIe serdes, but can also be configured as a x8controller or a x4 controller. PCIe controller 166A may support up to 8Physical Functions (PFs) of PCIe links. PCIe controller 166B maycomprise a x8 PCIe controller that supports up to 8 PCIe serdes. PCIecontroller 166B may also support up to 8 PFs of PCIe links. PCIecontrollers 166C and 166D may comprise x4 PCIe controllers that eachsupport up to 4 PCIe serdes. In examples where a x16 PCIe controller isassigned a full 16 serdes, it may only support up to PCIe Gen3 speed. Inexamples, where a PCIe controller is assigned 8, 4, 2 or 1 serdes, itmay support all the speeds from PCIe Gen1 through Gen4.

As illustrated in FIG. 5 , PCIe controllers 166 include modes units(MUs) 147A-147D (collectively, “mode units 147”). Each of PCIecontrollers 166 may operate in one of Endpoint (EP) or Rootport (RP)mode for NVMe or other known protocol applications or Switch Upstream(UP) or Switch Downstream (DN) mode for unknown protocol applications toimplement the PCIe proxy described herein. The operational mode for eachof PCIe controllers 166 may be set by host unit 146 or, in someexamples, PCIe proxy logic 148 of DPU 130. The PCIe interfaces or portsexposed by each of PCIe controllers 166 operate according to theoperational mode of the respective PCIe controller. The existence ofmultiple PCIe controllers 166, however, enables multiple PCIe portsoperating in different modes and/or multiple functions of a single PCIeport operating in different modes to co-exist on the same DPU 130 andbelong to the same PCIe proxy. For example, a single PCIe port mayinclude two PCIe functions or branches—one function operating in a firstmode (e.g., switch UP function mode) used to physically connect a PCIehost device to a physical PCIe endpoint device using PCIe over networkfabric, and the other function operating in a second mode (e.g., EPfunction mode) used to logically connect the PCIe host device to avirtual PCIe device implemented by DPU 130 as an abstraction of one ormore physical PCIe endpoint devices that are locally or remotelyattached to DPU 130.

A PCIe switch typically has at most 33 ports: 1 upstream port and atmost 32 downstream ports. In some examples, the only upstream port isassigned port_number 32 regardless of how many downstream ports exist,and the 32 downstream ports are assigned port_number 0 through 31,respectively. TLPs can be exchanged between any two ports in a PCIeSwitch. The upstream port can talk to all 32 downstream portsbi-directionally. A downstream port can talk to all 33 ports (includingitself) bi-directionally. When a PCIe switch is scaled-out into adistributed Ethernet interconnect, as described herein, logical tunnelsare established between DPUs to allow switch ports located in differentDPUs to communicate as if the switch ports were still in a single PCIeswitch component. A tunnel is created per source switch port, perdestination switch port, per direction. That is, for a switch portsource-destination pair, two tunnels are created for bi-directionalcommunication.

In some examples, for the scaled-out PCIe switch, a global identifier(GID) space may be shared by all the DPUs in a tunnel domain such that atunnel is a {src_GID, dst_GID} pair. The tunnels may be used for bothPCIe switch scale-out as well as other infrastructure mechanisms. As oneexample, DPU 130 may support 16*1024 tunnels per direction.

Within HU slice 160, the flow index may be organized as{destination_port_number_in_switch, source_controller_ID_in_slice}. DPU130 has all addressing information for a controller: it usesdestination_port_number_in_switch to find out where the peer ports arelocated, then programs lookup tables to set up the needed tunnelsaccordingly. The encapsulated TLP bytes are sent over to the destinationDPUs via these tunnels.

When arriving at a destination DPU, its networking unit uses the tunnelinformation to go through a content-addressable memory (CAM) structureto retrieve the following information: destination_slice_ID,destination_controller_ID and source_port number. Thedestination_slice_ID is needed to direct the packet (i.e., a WU) to theright port associated to the destination HU Slice. Thedestination_controller_ID and source_port_number are used by thedestination HU Slice to compose the flow index for delivering a readacknowledgement message.

FIG. 6 is a flow diagram illustrating an example operation forconverting between PCIe and Ethernet in a data processing unit, inaccordance with the techniques of this disclosure. A first exampleoperation of FIG. 6 is described with respect to PCIe proxy 23 of FIG.2A having DPU 27D as the first DPU and DPU 27A as the second DPU, andCPU 24 as the PCIe host device and GPU 26A as the PCIe endpoint device.

First DPU 27D implements PCIe proxy logic that configures host unitinterface 25D of first DPU 27D to operate in a first mode, i.e., aswitch upstream (UP) function mode, for the PCIe connection to CPU 24(200). When configured to operate in the switch UP function mode, hostunit interface 25D provides CPU 24 access to PCIe proxy 23 operating asa virtual switch attached to GPUs 26. Second DPU 27A also implementsPCIe proxy logic that configures host unit interface 25A of second DPU27A to operate in a second mode, i.e., a switch downstream (DN) functionmode, for the PCIe connection to GPU 26A (202). When configured tooperate in the switch DN function mode, host unit interface 25A providesGPU 26A access to PCIe proxy 23 operating as a virtual switch attachedto CPU 24.

First DPU 27D receives PCIe packets from CPU 24 on host unit interface25A (204). First DPU 27D determines that the received PCIe packets aredestined for GPU 26A, which is locally attached to second DPU 27Ainterconnected to first DPU 27D via network fabric 14A. First DPU 27Destablishes a logical tunnel across network fabric 14A between first DPU27D and the second DPU 27A (206). First DPU 27D then encapsulates thePCIe packets using a transport protocol over IP over Ethernetencapsulation (208). First DPU 27D then sends the encapsulated packetsover the logical tunnel to second DPU 27A (210).

Second DPU 27A receives the encapsulated packets over the logical tunnelfrom first DPU 27D (212). Second DPU 27A extracts the PCIe packets andsends the PCIe packets on host unit interface 25A to GPU 26A (214).These operations are transparent to CPU 24 and GPU 26A and appear, fromthe perspective of CPU 24 and GPU 26A, to be performed by a locallyattached PCIe switch.

A second example operation of FIG. 6 is described with respect to PCIeproxy 33 of FIG. 2B having DPU 37A as the first DPU and DPU 37D as thesecond DPU, and CPU 30 as the PCIe host device and SSD 36 as the PCIeendpoint device. In this example, PCIe proxy 33 operates as a virtualSSD that appears, from the perspective of CPU 30, to be a locallyattached PCIe endpoint device. DPU 37A may implement the virtual SSD asan abstraction of one or more physical PCIe endpoint devices that arelocally and/or remotely attached to DPU 37A, e.g., SSD 36 attached toDPU 37D of PCIe proxy 33.

In the example of FIG. 2B, host unit interface 35A includes at least twoPCIe functions with a first function operating in the switch UP mode andthe second function operating in the EP mode. In the case where CPU 30selects the EP function mode, first DPU 37A implements PCIe proxy logicthat configures host unit interface 35A of first DPU 37A to operate in afirst mode, i.e., the EP function mode, for the PCIe connection to CPU30 (200). When configured to operate in the EP function mode, host unitinterface 35A provides CPU 30 access to PCIe proxy 33 operating as avirtual SSD implemented as an abstraction of SSD 36. Second DPU 37D alsoimplements PCIe proxy logic that configures host unit interface 35D ofsecond DPU 37D to operate in a second mode, i.e., an RP function mode,for the PCIe connection to SSD 36 (202).

First DPU 37A receives PCIe packets from CPU 30 on host unit interface35A (204) destined for the locally attached virtual SSD implemented byfirst DPU 37A as an abstraction of remotely attached SSD 36. First DPU37A determines that the received PCIe packets are destined for SSD 36,which is communicatively coupled to second DPU 37D interconnected tofirst DPU 37A via network fabric 14B. First DPU 37A establishes alogical tunnel across network fabric 14B between first DPU 37A and thesecond DPU 37D (206). First DPU 37A then encapsulates the PCIe packetsusing a transport protocol over IP over Ethernet encapsulation (208).First DPU 37A then sends the encapsulated packets over the logicaltunnel to second DPU 37D (210).

Second DPU 37D receives the encapsulated packets over the logical tunnelfrom first DPU 37A (212). Second DPU 37D extracts the PCIe packets andsends the PCIe packets on host unit interface 35D to SSD 36 (214). Theseoperations are transparent to CPU 30 and appear, from the perspective ofCPU 30, to be performed by a locally attached virtual SSD.

For processes, apparatuses, and other examples or illustrationsdescribed herein, including in any flowcharts or flow diagrams, certainoperations, acts, steps, or events included in any of the techniquesdescribed herein can be performed in a different sequence, may be added,merged, or left out altogether (e.g., not all described acts or eventsare necessary for the practice of the techniques). Moreover, in certainexamples, operations, acts, steps, or events may be performedconcurrently, e.g., through multi-threaded processing, interruptprocessing, or multiple processors, rather than sequentially. Furthercertain operations, acts, steps, or events may be performedautomatically even if not specifically identified as being performedautomatically. Also, certain operations, acts, steps, or eventsdescribed as being performed automatically may be alternatively notperformed automatically, but rather, such operations, acts, steps, orevents may be, in some examples, performed in response to input oranother event.

The detailed description set forth above is intended as a description ofvarious configurations and is not intended to represent the onlyconfigurations in which the concepts described herein may be practiced.The detailed description includes specific details for the purpose ofproviding a thorough understanding of the various concepts. However,these concepts may be practiced without these specific details. In someinstances, well-known structures and components are shown in blockdiagram form in the referenced figures in order to avoid obscuring suchconcepts.

In one or more examples, the functions described may be implemented inhardware, software, firmware, or any combination thereof. If implementedin software, the functions may be stored, as one or more instructions orcode, on and/or transmitted over a computer-readable medium and executedby a hardware-based processing unit. Computer-readable media may includecomputer-readable storage media, which corresponds to a tangible mediumsuch as data storage media, or communication media including any mediumthat facilitates transfer of a computer program from one place toanother (e.g., pursuant to a communication protocol). In this manner,computer-readable media generally may correspond to (1) tangiblecomputer-readable storage media, which is non-transitory or (2) acommunication medium such as a signal or carrier wave. Data storagemedia may be any available media that can be accessed by one or morecomputers or one or more processors to retrieve instructions, codeand/or data structures for implementation of the techniques described inthis disclosure. A computer program product may include acomputer-readable medium.

By way of example, and not limitation, such computer-readable storagemedia can include RAM, ROM, EEPROM, CD-ROM or other optical diskstorage, magnetic disk storage, or other magnetic storage devices, flashmemory, or any other medium that can be used to store desired programcode in the form of instructions or data structures and that can beaccessed by a computer. Also, any connection is properly termed acomputer-readable medium. For example, if instructions are transmittedfrom a website, server, or other remote source using a coaxial cable,fiber optic cable, twisted pair, digital subscriber line (DSL), orwireless technologies such as infrared, radio, and microwave, then thecoaxial cable, fiber optic cable, twisted pair, DSL, or wirelesstechnologies such as infrared, radio, and microwave are included in thedefinition of medium. It should be understood, however, thatcomputer-readable storage media and data storage media do not includeconnections, carrier waves, signals, or other transient media, but areinstead directed to non-transient, tangible storage media. Disk anddisc, as used, includes compact disc (CD), laser disc, optical disc,digital versatile disc (DVD), floppy disk and Blu-ray disc, where disksusually reproduce data magnetically, while discs reproduce dataoptically with lasers. Combinations of the above should also be includedwithin the scope of computer-readable media.

Instructions may be executed by one or more processors, such as one ormore digital signal processors (DSPs), general purpose microprocessors,application specific integrated circuits (ASICs), field programmablelogic arrays (FPGAs), or other equivalent integrated or discrete logiccircuitry. Accordingly, the terms “processor” or “processing circuitry”as used herein may each refer to any of the foregoing structure or anyother structure suitable for implementation of the techniques described.In addition, in some examples, the functionality described may beprovided within dedicated hardware and/or software modules. Also, thetechniques could be fully implemented in one or more circuits or logicelements.

The techniques of this disclosure may be implemented in a wide varietyof devices or apparatuses, including a wireless handset, a mobile ornon-mobile computing device, a wearable or non-wearable computingdevice, an integrated circuit (IC) or a set of ICs (e.g., a chip set).Various components, modules, or units are described in this disclosureto emphasize functional aspects of devices configured to perform thedisclosed techniques, but do not necessarily require realization bydifferent hardware units. Rather, as described above, various units maybe combined in a hardware unit or provided by a collection ofinteroperating hardware units, including one or more processors asdescribed above, in conjunction with suitable software and/or firmware.

Various examples have been described. These and other examples arewithin the scope of the following claims.

What is claimed is:
 1. A network system comprising: a plurality of dataprocessing units (DPUs) interconnected via a network fabric, whereineach DPU of the plurality of DPUs implements proxy logic for a systembus connection, and wherein the plurality of DPUs and the network fabrictogether operate as a single system bus connection proxy; a host devicelocally attached to a host unit interface of a first DPU of theplurality of DPUs via a first system bus connection; and a plurality ofendpoint devices locally attached to host unit interfaces of one or moresecond DPUs of the plurality of DPUs via second system bus connections,wherein the proxy logic for the system bus connection supports gracefullinking and unlinking of the plurality of endpoint devices, wherein thefirst DPU is configured to, upon receipt of packets from the host deviceon the host unit interface of the first DPU and destined for a givenendpoint device of the plurality of endpoint devices, establish alogical tunnel across the network fabric between the first DPU and oneof the second DPUs to which the given endpoint device is locallyattached, encapsulate the packets using a transport protocol, and sendthe encapsulated packets over the logical tunnel to the one of thesecond DPUs, and wherein the one of the second DPUs is configured to,upon receipt of the encapsulated packets, extract the packets, and sendthe packets on a host unit interface of the one of the second DPUs tothe given endpoint device.
 2. The network system of claim 1, wherein, toencapsulate the packets using the transport protocol, the first DPU isconfigured to encapsulate the packets using a transport protocol over IPover Ethernet encapsulation.
 3. The network system of claim 1, whereinthe encapsulated packets are layer 3 (L3) routable within the logicaltunnel across the network fabric.
 4. The network system of claim 1,wherein, to encapsulate the packets using the transport protocol, thefirst DPU is configured to pack multiple system bus connectiontransaction layer packets (TLPs) into an Ethernet frame.
 5. The networksystem of claim 1, wherein the host unit interface of the first DPU isconfigured to provide the host device access to the system busconnection proxy operating as at least one of a virtual switch attachedto one or more of the endpoint devices or a virtual device implementedas an abstraction of one or more of the endpoint devices.
 6. The networksystem of claim 1, wherein the host unit interface of the first DPU isconfigured to operate in a switch upstream function mode for the firstsystem bus connection to provide the host device access to the systembus connection proxy operating as a virtual switch attached to one ormore of the endpoint devices; and wherein the host unit interface of oneof the second DPUs is configured to operate in a switch downstreamfunction mode for a second system bus connection to provide the givenendpoint device access to the system bus connection proxy operating asthe virtual switch attached to the host device.
 7. The network system ofclaim 6, wherein, from the perspective of the host device and the givenendpoint device, the system bus connection proxy appears to be a locallyattached system bus connection switch.
 8. The network system of claim 1,wherein the host unit interface of the first DPU is configured tooperate in an endpoint function mode for the first system bus connectionto provide the host device access to the system bus connection proxyoperating as a virtual device implemented as an abstraction of one ormore endpoint devices; and wherein the host unit interface of the one ofthe second DPUs is configured to operate in a rootport function mode fora second system bus connection to the given endpoint device.
 9. Thenetwork system of claim 8, wherein, from the perspective of the hostdevice, the system bus connection proxy appears to be a locally attachedendpoint device.
 10. The network system of claim 1, wherein the hostdevice comprises a central processing unit (CPU), and wherein the givenendpoint device comprises a graphics processing unit (GPU) allocated andprovisioned to the CPU from a pool of GPUs located anywhere in thenetwork system.
 11. The network system of claim 1, wherein the host unitinterface of the first DPU exposes a logical system bus connection proxymodel to the host device, and wherein the host unit interfaces of thesecond DPUs each expose a logical system bus connection proxy model tothe plurality of endpoint devices.
 12. The network system of claim 1,wherein the proxy logic for the system bus connection configures each ofthe host unit interface of the first DPU and the host unit interfaces ofthe second DPUs to operate in one of more of an endpoint mode, arootport mode, a switch upstream function mode, or a switch downstreamfunction mode.
 13. The network system of claim 12, wherein the hostdevice and the given endpoint device comprise application processors,wherein the proxy logic for the system bus connection implemented on thefirst DPU configures the host unit interface of the first DPU to operatein the switch upstream function mode to communicate with the hostdevice; and wherein the proxy logic for the system bus connectionimplemented on the one of the second DPUs configures the host unitinterface of the one of the second DPUs to operate in the switchdownstream function mode to communicate with the given endpoint device.14. The network system of claim 12, wherein the host device comprises anapplication processor and the given endpoint device comprises a storagedevice, wherein the proxy logic for the system bus connectionimplemented on the first DPU configures the host unit interface of thefirst DPU to operate in the endpoint mode to communicate with the hostdevice; and wherein the proxy logic for the system bus connectionimplemented on the one of the second DPUs configures the host unitinterface of the one of the second DPUs to operate in the rootport modeto communicate with the given endpoint device.
 15. The network system ofclaim 1, wherein the proxy logic for the system bus connection supportsreliable transmission of the encapsulated packets using the transportprotocol by maintaining Peripheral Component Interconnect Express (PCIe)ordering and deadlock prevention solutions.
 16. The network system ofclaim 1, wherein the proxy logic for the system bus connection supportssecurity within the network fabric, and wherein the logical tunnelestablished by the first DPU comprises an encrypted tunnel.
 17. A firstdata processing unit (DPU) integrated circuit comprising: a networkingunit interconnected with a plurality of DPUs via a network fabric; ahost unit comprising a host unit interface locally attached to a hostdevice via a system bus connection; and at least one processing coreconfigured to: execute proxy logic for a system bus connection, whereinthe plurality of DPUs, including the first DPU integrated circuit, andthe network fabric together operate as a single system bus connectionproxy, and wherein the host unit interface is configured to provideaccess to the single system bus connection proxy operating as at leastone of a virtual switch attached to one or more of a plurality ofendpoint devices or a virtual device implemented as an abstraction ofone or more of the plurality of endpoint devices, wherein the proxylogic for the system bus connection supports graceful linking andunlinking of the plurality of endpoint devices, and upon receipt ofpackets from the host device on the host unit interface and destined fora given endpoint device of the plurality of endpoint devices, establisha logical tunnel across the network fabric between the first DPUintegrated circuit and a second DPU integrated circuit of the pluralityof DPUs to which the given endpoint device is locally attached,encapsulate the packets using a transport protocol, and send theencapsulated packets over the logical tunnel to the second DPUintegrated circuit.
 18. The first DPU of claim 17, wherein, toencapsulate the packets using the transport protocol, the processingcore is configured to pack multiple system bus connection transactionlayer packets (TLPs) into an Ethernet frame.
 19. A method comprising:configuring, by a first data processing unit (DPU) of a plurality ofDPUs interconnected via a network fabric and implementing proxy logicfor a system bus connection, a host unit interface of the first DPU tooperate in a first mode for a system bus connection by which the hostunit interface is locally attached to a host device, wherein theplurality of DPUs and the network fabric together operate as a singlesystem bus connection proxy, and wherein the host unit interface of thefirst DPU is configured to provide access to the single system busconnection proxy operating as at least one of a virtual switch attachedto one or more of a plurality of endpoint devices or as a virtual deviceimplemented as an abstraction of one or more of the plurality ofendpoint devices, wherein the proxy logic for the system bus connectionsupports graceful linking and unlinking of the plurality of endpointdevices; receiving, on the host unit interface of the first DPU, packetsfrom the host device on the host unit interface, wherein the packets aredestined for a given endpoint device of the plurality of endpointdevices; establishing a logical tunnel across the network fabric betweenthe first DPU and a second DPU of the plurality of DPUs to which thegiven endpoint device is locally attached; encapsulating the packetsusing a transport protocol; and sending the encapsulated packets overthe logical tunnel to the second DPU.
 20. The method of claim 19,further comprising: configuring, by the second DPU, a host unitinterface of the second DPU to operate in a second mode for a system busconnection by which the host unit interface of the second DPU iscommunicatively coupled to the given endpoint device; and upon receiptof the encapsulated packets, extracting, by the second DPU, the packetsand sending, on the host unit interface of the second DPU, the packetsto the second endpoint device.